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ABSTRACT 

Combinations of five methods of equating test forms 
and two methods or selecting samples of students for equating were 
compared for accuracy. The two sampling methods were representative 
sampling from the population and matching samples on the anchor test 
score. The equating methods were: (1) the Tucker method; (2) the 
Levine method; (3) the chained equipercentile method; (4) the 
frequency estimation; and (5) an item response theory (IRT) method; 
specifically, the three-parameter logistic model. The tests were the 
verbal and i«at hematics sections of the Scholastic Aptitude Test. The 
criteria for accuracy were measures of agreement with an 
equivalent-groups equating based on more than 115,000 students taking 
each form. Much of the inaccuracy in tho equating s could be 
attributed to overall bias. The results for all equating methods in 
the matched samples were similar to those of the Tucker and frequency 
estimation methods in the representative samples; these equatings 
made too small an adjustment for the difference in the difficulty of 
the test forms. In the representative samples, the chained 
equipercentile method showed a much smaller bias. The IRT and Levine 
methods tended to agree with each other and were inconsistent in the 
direction of their bias. Five tables and' four figures present study 
data. (Author/SLD) 
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Abstract 



Combinations of five methods of equating test forms and two methods of 
selecting samples of students for equating were compared for accuracy. The 
two sampling methods were representative sampling from the population and 
matching sampler on the anchor test score. The equating methods were the 
"Tucker", "Levine" , "chained equipercentile" , "frequency estimation", and 
IRT (3PL) methods. The tests were the verbal and mathematical sections of 
the Scholastic Aptitude Test. The criteria for accurac were measures of 
agreement with an equivalent-groups equating based on lu^re than 115,^0 
students taking each form. 

Much of the inaccuracy in the equatings could be attributed to overall 
bias. The results for al 1 equating methods in the matched samples were 
similar to those for the Tucker and frequency es .mation methods in the 
representative samples; these equatings made too small an adjustment for the 
difference in the difficulty of the test form;-;. In the representative 
samples, the chained equipercentile method showed a mu<*h smaller bias. The 
IRT and Levine methods tended to agree with each other and were inconsistent 
in the direction of their bias. 



WHAT COMBINATION OF SAMPLING AND EQUATING METHODS WORKS BEST? 



Samuel A. Livingston, Neil J. Dorans, and Nancy K. Wright 
Educational Testing Service 

When a testing organization equates two forms of a test, the 
statisticians often have a choice of ways to select samples of student test 
papers to use in the equating. One possibility is simply to use all 
available test papers, but this choice may not always be the best choice. 
The statisticians also have a choice of methods to use in estimating the 
equating relationship between the two forms of the test. What combination of 
sampling and equating methods works best? 

The present study was an attempt tc answer this question for a 
particular type of equating situation that is comson in large-scale testing 
programs. In this situation a fom cf a test being given for the first time 

the "new form" --is equated to a form of the test that was given 
previously but is no longer being given -- the "old form". Ihe two forms are 
linked through an "anchor test" that was administered to students taking the 
new form and to students taking the old form. The anchor test may consist of 
a set of items contained in both forms, or it may be a separate test of 
similar abilities. The group of students taking the new form may or may not 
be similar in ability to the group that took the old form. 

This study is based on the assumption that there is a population of 
students for whom the equating is intended to be correct -- the "target 
population" -- and that it is possible to draw a sample of the st. 'dents 
taking the new form in such a way that this "new form sample" will be 
representative of the target population. A second important assumption is 
that the "true" equating relationship -- the one to he estimated, as closely 
as possible, from the available data -- is the equipercentile relationship in 
the target population. This equating will be referred to as the "target 
equating". It is the equipercentile equating that would result if the entire 
target population took both forms of the test, with no practice or sequence 
effects. 

The Sampling and Equating Methods 

The present study was a comparison of several combinations of two 
sampling methods and five equating methods. Both sampling methods assume 
that the new form sample -- the sample of students taking the new form whose 
test papers are used in the equating process -- is representative of the 
target population. The two sampling methods differ in the way the "old form 
sample" is selected. The first sampling method simply chooses the old form 
sample randomly (or by a quasi-random procedure such as "spaced sampling") 
from the population of students who took the old form. This method will be 
referred to as "representative" sampling, although the resulting samples are 
only approximately representative of their parent populations. The second 
sampling method uses the anchor test score as a stratifying variable to match 
the old form sample to the new form sample. It guarantees that the old form 
sample and the new form sample will have the same distribution of scores on 
the anchor test. This method will be referred to as "matched" sampling. 



The five different equating methods compared in the present study (see 
Dorans, 1989, for further details) include two linear methods, i.e., methods 
that assume that the equating relationship can be represented on a graph by a 
straight line. The other three methods are curvilinear methods; they make no 
such assumption. The first of the two linear methods will be referred to as 
the Tucker method (Angoff, 1984, pp. 109-112). It equates the estimated 
means and standard deviations of the scores that would have been observed if 
the students in the old form sample and those in the new form sample had 
taken both forms o£ the test. It is based on the assumption that the linear 
regression of new-form score on anchor- test score -- slope, intercept, and 
residual var* .ice -- is t le same in the old form sample (where it is 
unobserved) as in the ne», form sample (where it is observed). It makes a 
similar assumption for the regression of new-form score on anchor- test score. 



The second linear equating method will be referred to as the Levine 
method (Angoff, 1984, pp. 113-115). This method is similar to the Tucker 
method, except that the assumptions used to estimate the means and standard 
deviations in the combined sample are not the same as **Mse for the Tucker 
method. The assumptions for the Levine method involv .e regressions of 
new-form and old- form true scores on anchor-test true scores. Also, the 
variance of errors of measurement on each fcrm (rather than the residual 
variance in the regressions) is assumed to be the same in both samples. 

The third equating method is based on a procedure called "frequency 
estima:ion" (described in Angoff , 1984, p. 113). This procedure estimates, 
for each form, the joint distribution of scores on that form and the anchor 
test. This joint distribution is estimated for a group of students with a 
specified distribution of scores on the anchor test; we typically use the 
distribution in the combined (old form and new form) sample. ^ The key 
assumption is that the conditional distribution of scores on the new form, 
given the score on the anchor test, is the same in the old form sample (where 
it is unobserved) as in the new form sample (where it is observed) . The 
method makes a similar assumption for the old torm. Summing over scores c* 
the anchor test yields estimated distributions of scores of the combined 
sample on the new form and on the old form. The third method included in 
this study was an equipercentile equating of these estimated distributions. 

The fourth equating method (Angoff, 1984, p. 116) is the composite of 
two equipercentile equatings: an equating of the new form to the anchor test 
in the new form sample and an equating of the anchor test to the old form in 
the old form sample. Marco, et al , (1983) referred to this method as the 
"direct equipercentile" method. We prefer the term "chained equipercentile" 
method, because it consists of two separate equipercentile equatings, linked 
by the anchor test. 



lf This version of the method is consistent with Angoff 's description. 
When nly one of the samples is representative of the target population, it 
makes sense to use the anchor score distribution of that sample, rather than 
the combined sample. Nevertheless, in this study we have used the method as 
described by Angoff. 



The fifth method is based m item- response -theory (IRT), specifically, 
the three-parameter logistic model (Lord, 1980). This method began with a 
simultaneous ("concurrent") calibration of all the items in both forms, using 
the computer program LOGIST (Wingersky, et al . 1982). The estimated item 
parameters resulting from this calibration were used to estimate the expected 
scores or the two forms at several closely spaced levels of ability. The 
resulting (x,y) pairs define the equating transformation. For conciseness, 
this method will be referred to simply as "IRT". 

When the new form and old form samples of students have identical score 
distributions on the anchor test, both linear methods described above become 
identical to a simple linear equating of the observed sample means and 
standard deviations. These methods use the anchor test scores to adjust for 
ability differences between the new form and old form samples. When the two 
samples have identical score distributions on the anchor test, there are no 
adjustments to be made. Similarly, in perfectly matched samples, the chained 
equipercentile method and the frequency estimation equipercentile method both 
become equivalent to a simple equipercentile equating of the observed 
distributions (except for small differences introduced in the interpolation 
procedure). Therefore, to the extent that perfect matching is possible, the 
six methods described above are reduced to three methods in the matched 
saaples: linear equating, equipercentile equating, and IRT equating. 

The Data 

The tests equated in this study were the verbal and mathematical 
portions of two forus of the Scholastic Aptitude Test (SAT) that had been 
"spiraled" (i.e., administered in alternating sequence) in * regular SAT 
administration involving approximately a quarter of a million students. The 
equating of the SAT excludes students not in their junior or senior year of 
high school, and these students' pan? s were excluded from the equatings in 
this study. The remaining 236, 00* cudents are the target population. The 
spiraling r*£ test forms divided -s population into groups of 119,000 and 
117,000 students. (One group s slightly larger because of the way the test 
booklets are packaged.) A sanr ^e of 117,000 or more students, sampled by 
spiraling test forms, can be assumed to represent very closely the ability 
distribution of the full population. Therefore, the equipercentile equating 
of the score distributions of these two groups of students can be assumed to 
be, for all practical purposes, the equating relationship in the target 
population. 

£ One of the two forms had been designated as the "new form" and the other 
as the "old form" in the equating that had been used to report scores on 
these two forms, and these designations were kept in conducting the study. 
No anchor test was necessary for equating these two forms of the SAT to each 
other. However, several anchor tests were administered with th^se two forms 
for the purpose of equating them to other (past and future) forms of the SAT. 
These anchor tests were "spiraled" in the population of test- takers, so that 
each combination of test form and anchor test was taker by a stratified 
random sample of approximately 8,000 students. Four of these anchor tests -- 
two verbal and two math -- were administered with both of the forms in this 



study. The correlation between the anchor score and the score to be equated 
varied from .86 to .88. 

The four anchor tests made it possible to create, artificially , several 
anchor equating situations in which the populations of students taking the 
old form differed systematically in ability from the populations taking the 
new form situations in which the true equating relationship in the target 
population was known. Each equating situation consisted of a pair of 
artificial pseudo -populations linked by an anchor test. The new form 
pseudo-population in each pair was simply the entire group oi students taking 
the new form and the anchor test. Each old form pseudo -population was 
selected to be of systematically lower ability than the corresponding new 
form pseudo -population. The o?.d form pseudo -population in each pair was 
^elected from the students taking the old form and the anchor test,. by 
removing a portion of the higher-ability students. The old form pseudo- 
popul&^ons for equating the verbal test were selected on the basis of their 
math scores, and vice versa, to avoid selecting on either the anchor score or 
the score to be equated. 

Each new form pseudo -population, was paired with two different old form 
pseudo-populations of different ability lavels. One of the eld form pseudo- 
populations was selected to have a mean ability level approximately 0.2 
standard deviations lower than the new form pseudo -population. This pseudo- 
population will be referred to as the "0.2 population". The ocher pseudo 
population was selected to have a mean ability level approximately 0.4 
standard deviations lower than the new form pseudo -population. This pseudo- 
population will be referred to'as the "0.4 population" .2 The "0.2 
population?" varied in size from 6148 to 6658 students; the "0.4 populations" 
varied in size from 4367 to 4887 students. 

Although the new form and (particularly) the old form seudo-populations 
in this study are artificial, they are made up of real students. The data 
are not simulated data. They are the actual test responses of real students, 
sitting in testing rooms with their Number 2 pencils, trying to get into the 
colleges of their choice. 

Samples for Equating 

In every equating in this study, the new form sample *as a 
representative sample of the new form pseudo-population. These samples were 
drawn by a spaced random sampling procedure: dividing the data file into 
equal- sized blocks of a specified Size and selecting a specified number of 
students randomly from each block. The "representative" old form samples 



z The correlation between verbal and math scores is approximately .70. 
A "0.2 population" for equaling verbal scores was selected by specifying a 
distribution of math scores that had a mean (0.2 /.70) standard deviations 
below that of the full old form population. The resulting "population" had a 
mean verbal score approximately 0.2 standard deviations below that of the 
full population. A similar procedure «ms used for selecting the other old 
form "populations" . 
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were drawn in the same way from their respective old form pseudo-populations. 
All the representative samples were drawn to include approximately 3000 
students. 

The "matched" old form samples were to be drawn from their respective 
populations by using the anchor test as a stratifying variable and randomly 
selecting the same number of students at each score level as were in the new 
form sample. However, it was necessary to modify this plan, because there 
were not enough high-ability students left in the old form pseudo- 
populations, particularly the "0.4 populations". In selecting matched 
samples from the "0.2 populations" it was possible to select samples with 
anchor score distributions very similar to (though not exactly the same as) 
those of the new form samples. In selecting matched samples from the "0.4 
populations" it was necessary to reduce the sample size proportionally, from 
3000 to approximately 1500, and even then the samples were not perfectly 
matched on the anchor test score. 

The design of the study therefore involved a total of 16 pairs of 
samples for equating. The design is illustrated in Figure 1. Table Al, in 
the appendix, summarizes the characteristics of ':he equating samples: the 
type of sample (representative or matched and, for the old form samples, 
which population the sample was selected from), the number of students in the 
sample, and their mean score on the anchor test. If the "matched" old form 
samples were perfectly matched to the new form samples, their mean anchor 
scores would be exactly equal to those of the corresponding new form samples. 
As Table Al shows, they were not. 



Criteria for Accuracy 

The main criterion for judging the overall accuracy of each equating was 
the root-mean-squared deviation (RMSD) of the equated scores of the full new 
form population from their equated scores determined by the target equating. 
The RMSD is computed by the formula 

RMSD - J [ ( S n(x) tf(x) - y(x)]* ) / S n(x) ] , 

where n(x) is the number of examinees (in this case, the number of juniors 
and seniors in the full population) with raw score x on the new form, y(x) is 
the corresponding exact scaled score on the old form as determined by the 
criterion equating, and y(x) is the corresponding exact scaled score on the 
old form as determined by the other equating to be compared with the 
criterion equating. The summation is over the raw score levels on the new 
form. The equated scores are expressed on the College Board 200-to-800 
scale, and the RMSD statistics are in terms of this scale. Since the scores 
are reported in ten-point intervals, an RMSD statistic of 3.3 for an equating 
can be interpreted to mean that the equated scores of the new form population 
are, on the average, about one- third of a score level away from what they 
should have been. The standard deviation of scaled scores in the f \1 test- 
taker population, for both the verbal and math scores, is about 100 points; 
adjacent score levels (e.g., 450 and 460) differ from each other by about 
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one -tenth of a standard deviation. We consider an RMSD statistic of 5 or 
more as an indication of a problem equating. 

A secondary criterion for judging the accuracy of each equating was its 
bias -- its tendency to produce equated scores that were systematically too 
high or too low. The overall bias statistic is an average value for the new 
form population. The bias of the equating is computed by the formula 

Bias - ( 2 n(x) tf(x) y(x)] ) / S n(x) , 

where the symbols have the same meaning as in the formula for /the RMSD, 
Note that negative bias in one portion of the score range will cancel out 
positive bias in another portion of tile score range. Therefore, the bias 
statistic is not a good basis for evaluating an equating unless all the 
equated scores are too high or all are too low. However, the bias statistic 
is always valuable as a diagnostic tool for investigating the reason for a 
large RMSD statistic. 



Results 

Table 1 shows the mean and standard deviation, in the new form 
population, of the scaled scores that would result from each of the 
equatings. Table 2 shows the bias and RMSD of the scaled scores resulting 
from each of the equatings. These statistics are expressed in the units of 
the SAT 200-to-800 scale. The target equating is the equipercentile equating 
in the full population. Ihe information in Table 2 is presented graphically 
in Figures 2a to 4d. 

Figures 2a to 2d compare the accuracy of eight combinations cf sampling 
and equating methods in the four equating situations using the "0.4 
population" as the old form population. Each of these four plots contains 
eight data points; each tlata point represents a different combination of 
sampl* and equating methods. The accuracy of the equating, as indicated by 
the RMSD statistic, is represented by the diagonal distance from the data 
point to the origin. The horizontal component of this distance represents 
the bias in the equating. The vertical component has no simple 
interpretation; it represents all the other factors (i.e., other than a 
constant bias) that contribute to tha RMSD. 

All four plots in Figures 2a to 2d show a similar clustering of the data 
points. At the left of the graph, indicating a large negative bias, are the 
data points for the Tucker method and the frequency estimation method in the 
representative samples and for all three methods (linear, equipercentile , 
IRT) in the matched samples. Toward the center of each figure, indicating 
less negative bias, is the cata point for the chained equipercentile method 
in the representative samples. At the right of each figure, indicating the 
least negative bias or, in two cases, a positive bias, are the data points 
for the Levine method and the IRT method in the representative samples. 
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Figures ^a to 3d show the same information as Figures 2a to 2d, but for 
the equatings in which the old form samples were drawn from the "0.2 
populations". The data points in these four plots are all closer to the 
origin, indicating more accurate equatings. They tend to show the same 
general pattern as those in Figures 2a to 2d, but there are some differences. 
Figure 3b shows the matched -sample methods clustering separately from the 
Tucker and frequency estimation methods in the representative samples. The 
reason for this separation appears to be that, because of sampling 
variability, the relationship between the anchor test scores and the full 
verbal scores was not the same in the representative sample as in the matched 
sample. 

There is one result that appears consistently in the equatings of the 
math scores but not in the equatings of the verbal scores: although all 
methods in the matched samples show a consistent negative bias, the IRT 
method shows somewhat less bias than the other methods. In the equatings of 
the verbal scores, the IRT equating in the matched samples generally shows a 
smaller RMSD than the other matched* sample methods, but about the same degree 
of bias. 

Figures 4a to 4d show the bias and RtfSD statistics for equatings done by 
each method in the full subpopulations, i.e., all students taking each 
combination of test form and anchor test. These statistics represent the 
best results that can reasonably be expected from each method. Any bias in 
these equatings is the result of sampling variability in the full-test and 
anchor- test scores of the subpopulations. These methods show the two linear 
methods clustering closely together with the two equipercentile methods, 
while the IRT method tends to produce somewhat different results, especially 
for the equatings of the math scores. 

Note that none of the anchor equatings in Figures 4a to 4d exactly 
reproduces the target equating. Also note that all four sets of equatings 
show some degree of bias, especially the equatings of the verbal scores 
through anchor "vb". This bias is a result of sampling variability in the 
full tost and anchor scores. For example, the subpopulation taking the old 
form and anchor test "vb" did particularly well on the full test, averaging 
about 0.37 raw-score points (.024 SD) better than the group of all students 
taking the old form. The subpopulation taking the new form and anchor "vb" 
averaged only 0.06 raw-score points (.004 SD) betcer than the group of all 
students taking the new form. Yet, this difference was not reflected in the 
anchor scores; the anchor score means of these two groups differed by only 
.002 SD. As a result, all the equatings in these two subpopulations were 



-*For the equating results shown in Figure 3b, the mean difference 
between the anchor scores of the matched old- form sample and the 
representative old- form sample was .18 SD, while the mean difference in their 
full verbal scores was only .12 SD. In Figure 3a, which showed no such 
separation between the matched- sample results and the Tucker and frequency 
estimation results, the corresponding mean differences were .19 SD for the 
anchor test scores and .17 SD for the full verbal scores. 
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biased in a positive direction. For this reason it may be useful to 
recompute the bias and RMSD statistics for each equating, with the 
corresponding subpopulation equating as the target equating. 

Table 3 shows this comparison. Each of the bias and RMSD statistics in 
Table 3 compares the sample equating with it:s own target equating one that 
uses the same equating method and the same anchor t»*$ . Each of these 
special target equatings was determined in the subpopulation of all students 
taking the particular anchor test. That is, the subpopulation equatings of 
Figures 4a to 4d become the target equatings in Table 3. However, th*> bias 
and RMSD statistics for each of these comparisons are computed for the full 
population, not the subpopulation. These results show the same consistent 
bias for the Tucker and frequency estimation methods and for all methods 
applied in the samples matched on the anchor test. The results for the other 
methods are less clear. In the representative samples, the chained 
equipercentile method consistently produced equated scores that were lower 
than these produced by the Levine and IRT methods, even when each method is 
compared against a target equating by the same method. It is difficult to 
say whether this difference reflects a negative bias in the chained 
equipercentile method, a positive bias in the Levine and IRT methods, or 
both. 



A Partial Explanation 

The clustering of the methods in the results of this study is not a 
coincidence. The matched- sample methods tended to cluster together because 
they all use the information contained in the anchor scores in the same way: 
to create matched samples of test- takers. Once the samples have been 
selected to have identical anchor score distributions, there is no relevant 
information remaining to be extracted from the anchor scores. Therefore, all 
methods tend to produce similar results in the matched samples. 

The Tucker and frequency estimation methods tend to cluster with the 
matched- sample methods because they use the anchor score in a similar way: as 
a conditioning \ariable. Both these methods assume, for equating purposes, 
that if two groups of test- takers have identical score distributions on the 
anchor test, they should have identical distributions of equated scores. 
Putting it another way, students with r.he same anchor score are assumed to be 
exchangeable between the old form and new form populations. If either of 
these methods is applied to samples with identical anchor score 
distributions, the method assumes, in effect, that the samples ere of equal 
ability and simply equates the observed percentiles or the observed means and 
standard deviations. These methods can be understood as attempts to estimate 
the equating relationship in samples matched on the anchor test. 
Consequently, their results tend to agree with the results of equatings 
actually performed in samples matched on the anchor test. 

This explanation accounts for the tendency of these methods to show a 
negative bias in this study. The old form pseudo-populations in this study 
were systematically less able than the new form population. The anchor score 
is an imperfect measure of the abilities measured by the old form and new 
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form scores. When samples from different populations are matched on the 
anchor test, there is a "regression effect" , so that matching on the anchor 
score does not completely remove the ability difference between the samples. 
If the old form population was less able than the new form population, the 
old form sample will still be somewhat less able than the new form sample, 
even though their anchor scores do not show anv difference. Since the 
students in the old form sample are less able then they would be if they were 
truly matched on ability, they tend to earn lower scores on the old form 
(i.e., lower than they would if they were truly as able as the new form 
sample.) Therefore, the old form appears relatively harder, the new form 
appears relatively easier, and the equating does not award a high enough 
equated score for a given raw score on the new form. As a result, the 
equating is biased in a downward direction. (Ironically, in the equatings 
shown in Figure 3b, the inconsistency between the full- test and anchor- test 
scores of the "representative** old-form sample tended to counteract this 
general tendency. Consequently, the results of the Tucker and frequency 
estimation methods in the representative samples were free of bias in this 
one case.) 

It is difficult to explain fully why the Levtne method and the IRT 
method tended to agree so consistently in the matched samples. A comparison 
of the actual conversion lines also showed these two methods tending to 
deviate from the target equating in the srme direction in the same parts of 
the score scale. The two methods are similar in that they both assume 
exchangeability (across old- form and Bw-form samples) for students with the 
same true score, rather than for students with the same observed anchor 
score. This assumption seems to produce a general agreement in the results 
of these two methods, even though they differ in their other assumptions, in 
the type of data they use (item- level, or score-level), and even in their 
definition of equating! 

It is not surprising that the chained equipercentile method did not 
cluster with any of the other methods. This method is based on an entirely 
different logic from the other methods; it uses the information in the anchor 
scores in a different way. It does not estimate old- form and new- form 
distributions on the basis of assumptions about the exchangeability of 
students between old and new form samples. Instead, it simply equates the 
new form to the anchor in the new-form sample and the old form to the anchor 
in the old- form sample. The implicit assumption is that the equating 
relationship between each form and the anchor test is the same in the group 
where it is unobserved as it is in the group where it is observed. That -is, 
the chained equipercentile method assumes that equating relationships are' 
stable across populations of test-takers. This method tended to produce 
surprisingly good results, particularly in its lack of bias. The weakness of 
this method was that it tended to produce score conversion lines that 
fluctuated greatly around the target equating. Therefore, it seems 
reasonable to expect that the addition of a smoothing step would 
substantially reduce the RMSD for this method, possibly making it the best of 
the methods tested in this study. 



Limitations of the Study 
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Ixi any study such as this one, the generality of the findings are 
subject tc question. While our results are similar to those of Marco, et al . 
(1983), we cannot be sure that they would generalize to all tests equated 
through an anchor. Although this stuJy represented a considerable amount of 
effort by several people, its scope is limited. It involved only two tests 
and only one population of test- takers. Each of the pseudo-populations was 
*ic\y one of many that could have b« selected by the specif ieu procedure. 
Each equating sample was only one o.* many t? it could have been drawn from its 
parent pseudo-population. Would \b have found the same results in equating a 
test through an internal anchor? What if the populations had beer, selected 
on a variable that correlated much greater or much less than .70 with the 
scores to be equated? 

The subpopulations of students taking the different combinations of test 
form and anchor test were not perfectly representative of the full 
population. Their full-test scores end anchor scores showed differences that 
would not have occurred if the subpopulations had been perfectly 
representative of the full population. Differences such as these might be 
expected because of sampling variability. Nevertheless, they trended to 
introduce bias into the equatings. As a result of these differences, even 
the equatings in the subpopulations were not free of bias. The amount and 
direction of bias varied, as might be expected, from one pair of 
subpopulations to another. 

. A different type of Imitation of tne study is the way in which the 
pseudo -populations were created. The method we used -- selecting on a 
variable correlated .70 with the scores to be equated -- may or may not 
correspond to the way ability differences between old- form and new- form 
populations arise in the real world of educational testing. 



Implications for Research an d Practice 

One promising area for further investigation concerns the many equating 
methods and variations of methods that we did not investigate. How much 
better would the chained equipercentile method have performed if we had 
smoothed the distributions before equating? What would we have found if we 
had used a different IRT parameter estimation method? 

Another promising area for investigation concerns other possible 
sampling methods. The anchor test is not the only possible variable to uso 
for matching equating samples. Would matching samples on some other variable 
produce better equating results than simply choosing samples that are 
representative of their respective populations? The Levine and IRT equating 
methods assume that the best matching variable to use, if it were pc ?sible to 
do so, would be the student's true score. However, it is not possible to 
select students on the basis of their true scores. Ideally, the right 
variable to use as a basis for matched sampling would be a measure of 
whatever is causing the old-form and new-form populations to differ 
systematically in ability. In practice, we cannot even know what this 
variable is; we certainly cannot measure it. However, we can do the next 
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best thing. Applying a statistical technique originally developed for 
medical research, we can select on a "propensity score" (Rosenbaum and Rubin, 
1985). A propensity score is similar to a discriminant function; it is the 
linear combination of all the variables we can measure that best 
discriminates between the two populations, in this case, the old* form and 
new- form populations. How accurately could we equate test forms if we 
selected samples by matching on a propensity score? 

In addition to «».ese practical research questions, there are other 
types of studies that could improve our understanding of the effects of 
sampling on equating results. Suppose we repeated this study with samples 
matched on the variable originally used to construct the pseudo-populations 
(i.e., math scores for equating the verbal test, and vice versa). Would the 
equating results be unbiased? If we used this variable as an anchor in the 
Tucker and frequency estimation equatings, would the results be unbiased? If 
we constructed new pseudo-populations by selecting on the anchor test itself, 
would matching on the anchor score produce unbiased equatings? 

r 

Although additional research is desirable, those who have the 
responsibility for equating test scores cannot wait for additional results 
before deciding which sampling and equating methods to use. What guidance, 
if any, do the results of this study offer them in making these decisions? 
When populations differ in ability, matching samples on the anchor test does 
little to enhance the accuracy of the equating. It makes little difference 
in the results of the Tucker and frequency estimation methods, and it tends 
to make the results of any other equating methods very similar to those of 
the Tucker and frequency estimation methods. A better way to deal with the 
ability difference would be to select a representative sample from each 
population and use an equating method that does not assume exchangeability 
for test- takers classified by their anchor test scores. An alternative would 
be to select samples by matching on something other than the anchor test, 
possibly combining several such variables into a single "propensity score". 

If matching on a propensity score is the right thing to do, why even 
bother to study the effects on matching on the anchor test score? The reason 
is that matching on a propensity score is a much more laborious process than 
matching on the anchor score. If matching on the anchor test would produce 
good results, the more involved procedure of matching on a propensity score 
would be superfluous. In the present study, matching on the anchor test did 
not produce good results. It appears that selecting equating samples by 
matching on a propensity score may be a technique worthy of further 
investigation. 
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Table 1. Mean equated scores (on SAT 200-to-SOO scale) of full new- form population 
using raw-to-scale score conversion from each equating. 



Equating method: 

Old form Anchor 
sample test 


Linear 


Equl% 


IRT 


Tucker 


Freqency 
est. 


Chained 
equl% 


Levirce 


IRT 


0.4 


va 


mean 
SD 


416 
103 


415. 
103 


416 
105 


415_ 

m_ 


416_ 
102 


420 
104 


423 
106 


423 
IQ7_ 


0.2 


va 


mean 
SD 


421. 
104 


42L 

194_ 


421 
106. 


421 

m. 


421. 

1Q4_ 


423 
1Q5_ 


421. 

106 


424 
IQ7_ 


0.0 


va 


mean 
SD 








424 
105. 


424 

m. 


423 
105_ 


424. 
1Q5_ 


424 
106 



| 0.4 


vb 


mean 


420 


42P_ 


42Q_ 


412_ 


420_ 


425_ 


428 


428 






SD 


103 


1QL 


1S5_ 


102 


103 


106 


108 


109 


0.2 


vb 


mean 


422 


422 


422. 


425_ 


421. 


427 


428_ 


429 






SD 


19JL 


104 


1Q5_ 


102 


122. 


104 


1Q5_ 


106. 


0.0 


vb 


mean 








42Z. 


426 


422. 


422. 


427_ 






SD 








1QL. 


105. 


105 


105 


1Q6_ 






(Target equating for 


verbal scores: mean 


- 425; SD - 


105) 






* 0.4 


ma 


mean 


464 


464 


465 


465 


466 


422. 


476 


422. 






SD 


Hi. 


iiL 


lli- 


Hi. 




USl 


121. 


121 


0.2 


ma 


mean 




471 


422. 


470 


470 


473 


474 


476 






SD 


UL. 


UL. 


11S_ 


112. 


112. 


U2_ 


120 


121. 


0.0 


ma 


mean 








476 


476 


476 


476 


478 






SD 








HZ_ 


112. 


117 


117 


119 



0.4 


mb 


mean 


467 


468 


470 


468 


468 


476 


Ml. 


481 






SD 


Hi. 


lli. 


121. 


120 


112. 


120 


122. 


123 


0.2 


mb 


mean 


42L 


422. 


474 


4Z2. 


422. 


476 


478 


479 






SD 


120 


111. 


121. 


112. 


lli. 


119 


122 


122 


0.0 


mb 


mean 








(UL. 


476 


476 


476 


422. 






SD 








llfi. 


118. 


117 


IIS. 


120 



(Target equating for math scores: mean -477; SD - 118) 



Table 2. Bias and root -mean- squared difference (BMSD) of equated scores 
(on SAT 200-to-800 scale) produced by each equating from those 
produced by equipercentile equating in the fall population. 

Sampling method: Matched on Anchor Representative 

Equating method: Linear Equi% IRT Tucker Freqency Chained Levine IRT 

est. equi% 

Old form Anchor 
sample test 



0.4 


va 


bias 


-9,1 


-9,2 


-8.3 


-?,6 


-9,0 




-1.7 


-1,5 






RMSD 




_i.,8 


8,4 


_iP.,5 


9.9 


6,3 


2,8 


3,0 


0.2 


va 


bias 


-3.$ 


-3.5 


.-3.4 


-3,8 




•1.3 


-0,2 


-0.8 






RMSD 


4,4 


4,8 


3,7 


4.7 


"4^4 


4.4 


2.3 


2.6 


0.0 


va 


bias 








-;.o 


-0,9 


-1,1 


•XA 








RMSD 








2.5 


1.8 


2.7 


2 5 


2 0 


0.4 


vb 


bias 


-4,6 


-4,7 


-4,8 


-5.8 


.5.0 


+0,9 


+3.6 


+3,4 






RMSD 


5.5 


, 5,7 


_5^2 


6.8 




3,4 


5,3 


-1*8 


0.2 


vb 


bias 


-27 


-2.7 


-2,2 


40,3 


_+0,,5 


+?.? 


+3,7 


+3,9 






RMSD 


3.5 


3,3 


3,0 


3,5 


3.1 


4.4 


4.3 


4.3 


0.0 


vb 


bias 








+2.0 


+1.8 


.+1,9 


+1,9 


+2.3 






RMSD 








3,9 


2.5 


2.8 


3.0 


2.8 

! fc 1 Y 


U.4 


ma 


□las 


-U,9 


-12.8 


11 /• 

-ll. 6 


11 t\ 

dLL.?. 


-11.1 


■4,3 


..-0,5 


+0.2 






BMSD 


13.1 


13,7 


K.o 


12.? 


11.7 


_5J= 


3,6 


3,8 


0.2 


ma 


bias 


-5,8 


-5,7 


-4,8 


-6,9 


-6,5 


•3,8 


-2,2 


-0.8 






BMSD 


6,2 


6,4 


-1*3 


7,1 


6.9 


4,5 


3,8 


3,9 


0.0 


ma 


bias 








-0,8 


-0,8 


-0,2 


-0.2 


+1.5 






BMSD 








1,9 


1.8 


1.9 


1,8 


2,4 


0.4 


mb 


bias 




-8,9 


-6,8 


-9,0 


-8,7 


-1,3 


+4,4 


+4.5 






BMSD 


9,4 


9,9 


8,2 


9,4 


9.7 


6,0 


10,4 


7,5 


0.2 


mb 


bias 


-5.; 


-5.0 


-Ad 


-4.5 


-4,4 


-0,6 


+1.7 


+2,9 






BMSD 


5,7 


5.9 


4,4 




5,5 


2,7 


4.9 




0.0 


mb 


bias 








•1,3 


-1,3 


-1,0 


-1,1 


+0,7 






BMSD 








2,1 


1,9 


1.5 


1.9 


3,2 



Table 3. Bias and root -mean- squared difference (RMSD), in the full population, of equated scores 
(on SAT 200- to- 800 scale) produced by each equating from those produced by an equating using the 
same method applied in the subpopulation of all students taking the anchor test. 



Sampling method: Matched on Anchor 
Equating method: Linear Equi% IRT 



Old form Anchor 
sample test 



0.4 


va 


bias 


-8,1 


-9.1 


-7.2 






RMSD 




9,9 




0.2 


va 


bias 




-2,4 


-2,3 






RMSD 


3,1 


.... 4,0 


2.6 


0.4 


vb 


bias 


-6,6 


-6,7 


-7,1 






RMSD 


7.0 


7,9 


7.4 


0.2 


vb 


bias 




-4,6 


-4,5 






RMSD 


4,9 


5,0 


4,7 


0.4 


ma 


bias 


-12.2 


-12,6 








RMSD 


_12Ji 


13,5 


13.3 


0.2 


ma 


bias 


-5,0 


-5,5 


-6,4 






RMSD. 


5.1 


6,2 


6,5 


0.4 


mb 


bias 


-7,9 


-8,0 


-7.5 






RMSD 


7,? 


?,0 


7,7 


0.2 


mb 


bias 


-3,8 




-3,4 






RMSQ 


4,0 


5,1 


3,8 



Representative 



Tucker 


Freqency 


Chained 


Levine 


IRT 




est. 


equi% 








-9,1 


-3,0 


-0,7 


-0,4 




9,0 


5.1 


0.8 


1.4 


-2,8 


-2.4 


-0.2 


+0.8 


+0.3 




3.4 


-JL1 


0,9 


1.3 


-7,7 


-6.9 


-1,1 


+1,6 




9,4 


-LI 


2,6 


3,4 




-1.7 


-1.3 


40.9 


+1 8 


+1.6 


3,5 


3,6 


2,6 


1.8 


1.7 


sXLl 


-10,3 


-4,1 


-0,4 


-1.4 


jy 


19.? 


5,3 


4,1 


3,0 




-5.7 


-3,5 


-2.0 


-2.4 




6,2 


4,7 


4,1 


3,5 


-7,7 


-7,5 


-o t ? 


+5,5 


+3,8 


7,9 


8,4 


6.2 


10,9 


6,1 


-3,1 


-3,1 


+0,3 


+2,8 


+2,2 


3,2 


4,1 


2.7 


5,2 


3.3 
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APPENDIX 
Table Al. 

Characteristics of equating samples: 
Type of sample, number of students, mean anchor test score 



Test, anchor 
Verbal, va 



Verbal, vb 



Math, ma 



New Form Sample 



Old Form Sample 



Type 


n 


anchor mean 


Type 


n 


anchor mean 


rep. 


3007 


19.11 


0.4 matched* 


1507 


19.15 








0.4 rep. 


2998 


15.77 








0.2 matched* 


3006 


19.12 








0.2 rep. 


2997 


17.45 


rep.** 


7959 


19.16 


0.0 rep.** 


3512 


19.17 


rep. 


3004 


18.22 


0.4 matched* 


1525 


18.36 








0.4 rep. 


2998 


14.80 








u . c. maccneu* 




ID <)1 








0.2 rep. 


2999 


16.69 


rep.** 


7625 


18.21 


0.0 rep.** 


8329 


18.23 


rep. 


2999 


11.36 


0.4 matched* 


1498 


11.36 








0.4 rep. 


3002 


8.96 








0.2 matched* 


2999 


11.36 








0.2 rep. 


3000 


10.29 


rep.** 


8450 


11,45 


0.0 rep.** 


8000 


11.28 



(Continued on next page) 



J9 



rf- 
I 



Math, mb 



Table Al. (continued) 



rep. 


2998 


10.26 


0.4 matched* 


1480 


10.20 








0.4 rep. 


3003 


7.69 








0.2 matched* 


2998 


10.26 








0.2 rep. 


2999 


8.91 


rep.** 


8161 


10.20 


0.0 rep.** 


7764 


10.12 



* Imperfectly matched 

** Subpopulation created by spiralin? of test forms 
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Figure 1/ Design of thfj Study 



Full population 
238,000 students 



Old form subpopulation ^ randomly . 

119,000 students equivalent 



New form subpopulation 
117,000 students 



Anchor test subpopulations «<- 
8,000 students each 

Verbal Math 
va, vb ma, mb 



all 



randomly 
equivalent 



->r Anchor test subpopulations 
8,000 students each 

Verbal Math 
va, vb ma, mb 




Old form pseudo-populations 
4,000 to 7,000 students 



vb 0.4 
vb 0.2 



ma 0.4 
ma 0.2 



mb 0.4 
mb 0.2 



Old form equating samples 
1,500 or 3,000 students each 



va 0.4 representative 

va 0.2 representative 

vb 0.4 representative 

vb 0.2 representative 

ma 0.4 representative 

ma 0.2 representative 

mb 0.4 representative 

mb 0.2 representative 



va 0.4 matched 
va 0*2 matched 

vb 0.4 matched 
vb 0.2 matched 

ma 0 . 4 matched 
ma 0.2 matched 

mb 0 . 4 matched 
mb 0 . 2 matched 



New form equating samples 
3,000 students each 



va representative 
vb representative 



->> ma representative 



mb representative 



h- - 
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Figure 2a. Bias and RMSD in equating Che verbal scores through anchor N va M , 
sampling from the "0.4 population". 
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DIAS 

Figure 2b. Bias and RHSO in equating the vcibal scores through anchor M vb M , 
sampling fron the population". 



T - 
I. - 
C - 


Tucker 
Levine 

Chained equipercentile 
Frequency estimation 
IRT 




M 
E 

X 


- Hatched Linear 

- Hatched Equipercentile 

- Hatched IRT 


F - 
I - 








E 










F 

M 




L 


I 






to -i 
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BIAS 

Figure 2c. Bias and RMSD in equating the math scores through anchor "ma", 
sampling from the "0.4 population". 
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Figure 2d. Bias and RHSO in equating the math scores through anchor "mb", 
sampling from the "0.U population". 



T - Tucker 
L - Levlna 

C - Chained equipercentile 
F - Frequency estimation 
I - IRT 



M - Hatched Linear 

E - Hatched Equiperccntile 

X - Hatched IRT 
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Figure 3a. Bias and RHSD in equating the verbal scores through anchor "va" 
sampling from the -0.2 population". 
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Figure 3b. Bias »nd RHSD In equating Che verbal scores chrougb ancSor "vb" 
sampling from Che "0.2 populadon". 
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Figure 3d. Bios and RHSD in equating the math scores through anchor "mb". 
sampling from the "0.2 population- . 
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Pigure 4a. Bias and RHSD In equating the verbal scores th.ough anchor 
in the anchor test subpopulations. 




BIAS 

Figure 4b. Bias and RHSD in equating the verbal soores thiough Anchor 
1 che anchor test subpopulations . 



2.6 




T - Tucker 
L - Levine 

C - Chained equipercentile 




Figure Uc, Bias and RHSD in equating the nath scores through anchor "iW 
the anchor test subpopulations. 




27 



